METHOD AND SYSTEM FOR ASSOCIATING EVENTS 



Background of the Invention 

1, Technical Field 

The present invention relates to a method, system, and computer program product for 
associating events. 

2. Related Art 

In a web environment, preventing potential errors and failures has been a major issues in 
web system design. Unfortunately, preventing system errors and failures on a complex web 
system is very difficult. Accordingly, there is a need for an efficient and accurate method and 
system for preventing such errors and failures on such a complex web system. 

Summary of the Invention 

The present invention provides a method for associating events, comprising the steps of: 

providing an event dataset that includes a plurality of events occurring in each of N 
successive time intervals, said 3; 

deducing from the event dataset a plurality of association rules, each association rule E K 
=*E L of the plurality of association rules expressing an association between events E K and E L 
respectively occurring in two successive time intervals of the N time intervals, said events E K and 
E L being in the event dataset; 
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generating a plurality of sequences of events, each sequence of the plurality of sequences 
being generated from at least two sequentially ordered association rules of the plurality of 
association rules; 

forming a plurality of clusters from the plurality of sequences in accordance with a 
clustering algorithm, each cluster of the plurality of clusters including at least two sequences of 
the plurality of sequences; and 

creating S c sequences of clusters from the plurality of clusters, said S c 2> 1, each sequence 
of the S c sequences including at least two clusters of the plurality of clusters. 

The present invention provides a system for associating events, comprising the steps of: 

means for providing an event dataset that includes a plurality of events occurring in each 
of N successive time intervals, said 3; 

means for deducing, from the event dataset, a plurality of association rules, each 
association rule E K =>E L of the plurality of association rules expressing an association between 
events E K and E L respectively occurring in two successive time intervals of the N time intervals, 
said events E K and E L being in the event dataset; 

means for generating a plurality of sequences of events, each sequence of the plurality of 
sequences being generated from at least two sequentially ordered association rules of the plurality 
of association rules; 

means for forming a plurality of clusters from the plurality of sequences in accordance 
with a clustering algorithm, each cluster of the plurality of clusters including at least two 
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sequences of the plurality of sequences; and 

means for creating S c sequences of clusters from the plurality of clusters, said S c ^ 1 , each 
sequence of the S c sequences including at least two clusters of the plurality of clusters. 

The present invention provides a computer program product comprising a computer 
usable medium having a computer readable program embodied therein, said computer readable 
program adapted to access an event dataset that includes a plurality of events occurring in each of 
N successive time intervals, said computer readable program further adapted execute a method 
for associating events, said method comprising the steps of: 

deducing from the event dataset a plurality of association rules, each association rule 
=>E L of the plurality of association rules expressing an association between events E K and E L 
respectively occurring in two successive time intervals of the N time intervals, said events E K and 
E L being in the event dataset; 

generating a plurality of sequences of events, each sequence of the plurality of sequences 
being generated from at least two sequentially ordered association rules of the plurality of 
association rules; 

forming a plurality of clusters from the plurality of sequences in accordance with a 
clustering algorithm, each cluster of the plurality of clusters including at least two sequences of 
the plurality of sequences; and 

creating S c sequences of clusters from the plurality of clusters, said S c ^ 1, each sequence 
of the S c sequences including at least two clusters of the plurality of clusters. 
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The present invention provides an efficient and accurate method and system for 
preventing errors and failures on a complex web system. 

Brief Description of the Drawings 

FIG. 1 illustrates an association rule, in accordance with embodiments of the present 
5 invention. 

FIG. 2 which depicts sequentially-ordered sets of event notifications, in accordance with 
embodiments of the present invention. 

FIG. 3 depicts instances of cluster volumes at successive event notifications, in 
accordance with embodiments of the present invention. 
10 FIG. 4 depicts a modular software system for problem detection and error diagnostics for 

an autonomic computer system, in accordance with embodiments of the present invention. 

FIG. 5 depicts functional components of an autonomic computing system, in accordance 
with embodiments of the present invention. 

FIG. 6 is a table listing hardware and software for system components of an autonomic 
1 5 computing system, in accordance with embodiments of the present invention. 

FIG. 7 is a table listing operations and error types the system components of FIG. 6, in 
accordance with embodiments of the present invention. 

FIG. 8 is a table listing vector element indexes for vectors describing executed operations 
and occurring errors, in accordance with embodiments of the present invention. 
20 FIG. 9 is a table listing events recorded in the logs of the autonomic computing system, 
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in accordance with embodiments of the present invention. 

FIG. 10 is a table listing association rules deduced from the recorded events of FIG. 9, in 
accordance with embodiments of the present invention. 

FIG. 1 1 is a table listing two of the association rules in FIG. 10, in accordance with 
embodiments of the present invention. 

FIG. 12 is a table listing a sequence of events derived from the association rules in FIG. 
1 1, in accordance with embodiments of the present invention. 

FIG. 13 is a table listing sequences of events derived from the association rules of FIG. 
10, in accordance with embodiments of the present invention. 

FIG. 14 is a table listing clusters of the sequences of events of FIG. 13, in accordance 
with embodiments of the present invention. 

FIG. 15 is a table listing sequences of the clusters of FIG. 14, in accordance with 
embodiments of the present invention. 

FIG. 16 is a flow chart depicting a method for associating events, in accordance with 
embodiments of the present invention. 

FIG. 17 is a block diagram of a computer system used for associating events, in 

accordance with embodiments of the present invention. 

Detailed Description of the Invention 

The present invention relates to a method, system, and computer program product for 
associating events, and using the association of events to: identify probable cause(s) of some of 
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said event(s); and predict occurrences of future events. The detailed description of the present 
invention comprises the sections of: Introduction To the Invention and General Formulation of 
the Invention 

Introduction To the Invention 

FIG. 4 depicts a modular software system for problem detection and error diagnostics for 
an autonomic computer system, in accordance with embodiments of the present invention. In 
FIG. 4, the modular software system is managed by a Problem Detection Module 10 which 
interfaces with other modules comprising: a Self-Learning Module 12, a Self-Monitoring Module 
14, a Self-Knowledge Module 16, a Self-Diagnostic Module 18, and a Problem Resolution 
Module 20. The modular software system also includes a Knowledge Base 22 coupled to the 
Problem Detection Module 10, wherein the Knowledge Base22 is a database that stores problem 
reports and solution records. 

Each of the modules in FIG. 4 is associated with specific classes of events. Examples of 
"events" are occurrences of hardware and software errors in a computer system. Examples of 
such errors are the "Errors" listed in FIG. 7, and examples of the occurrences of said errors are 
listed the "Events" column of FIG. 9. Examples of classes of events are the "clusters" listed in 
FIG. 14. Note that FIGS. 7, 9, and 14 will be described infra in the context of an illustrated 
example. 

Embodiments of the present invention are directed to monitoring and predicting errors in 
the autonomic computing system in connection with events associated with the preceding 
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modules. Accordingly, FIG. 5 depicts functional components of the autonomic computing 
system, said functional components collectively being directed to collecting, organizing, 
patterning, and analyzing data from which errors may be monitored and predicted, in accordance 
with embodiments of the present invention. The functional components, which are coupled 
5 together as shown in FIG. 5, include: computing nodes 3 1, an event logs database 32, a monitor 
33, a pattern analyzer 34, a pattern/rules database 35, and a notifier 36. Events from different 
nodes 3 1 are monitored and recorded in the event logs database 32. The pattern analyzer 4 
extracts the event class sequences associated with system failures. If the monitor 33, matches a 
current event class with a sequence leading to a failure, then the monitor 33 invokes the notifier 6 

10 to take an appropriate action. 

The pattern/rules database 35 stores association rules derived from data comprising 
events that occurred and the frequency of occurrence of said events. An association rule may be 
expressed in the form X=>Y, which means that if event X occurs in a first time interval At, then 
event Y will occur in a next time interval At 2 , wherein the events At, and At 2 are successive time 

15 intervals, with a confidence (i.e., probability of occurrence) of c%. Thus, an association rule has 
an associated confidence or probability of occurrence. Examples of association rules are shown 
in FIG. 10, described infra in conjunction with the illustrated example. An algorithm for 
generating association rules is shown in Table 2, as described infra in the General Formulation 
section. 

20 In an embodiment, the pattern analyzer 34 in FIG. 5 may perform the following tasks: 

discover patterns of event and event class sequences; and identify patterns of events and event 
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classes leading to system failures. Thus, the pattern analyzer 34 may execute event predictions, 
event sequence predictions, and event class predictions. 

As an example of using the present invention to monitor and predict errors in the 
autonomic computing system of FIGS. 4 and 5, consider the autonomic computer system to 
5 include the following hardware and software: CPU, BUS, MEMORY, HARD DISK, SCREEN, 
KEYBOARD, HARDWARE 1, HARDWARE 2, .., HARDWARE n, SOFTWARE A, 
SOFTWARE B, .., SOFTWARE Z. The users of the system start to notice several "system 
failures" during their normal daily operations. These system failures affect especially the 
MEMORY and the SOFTWARE B. A log of all the system events during the last 6 months is 
10 provided, and a task to build an automated diagnostic system which would allow finding the root 
cause of these "system failures" and would predict such errors in the future. Aspects of this 
example are depicted in FIGS. 6-15, in accordance with embodiments of the present invention. 

To solve the "system failures" problem, the following steps are performed: 

• STEP 1 : Build a data model for events 

1 5 • STEP 2: Extract temporal event association rules 

• STEP 3 : Extract event sequences 

• STEP 4: Cluster events 

• STEP 5: Create sequences of clusters 

• STEP 6: Apply the cluster sequences to identify a probable cause of a system failure 
20 • STEP 7: Apply the cluster sequences to predict system failures 
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STEP 1 : Build a Data Model for Events 

For this example, consider the hardware and software system components listed in FIG. 6. 
Also, consider only a small number of operations and error types for each hardware and software 
listed in FIG. 6. Said operations and error types for each system component of FIG. 6 are listed 
5 in FIG. 7. 

Each system event in this example is represented as a vector. Each element of the vector 
corresponds to an operation or an error from FIG. 7, and can have the value 0 or 1 in accordance 
with the vector element indexes of FIG. 8. For example the vector 

(1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0) corresponds to the event where the CPU-Opl and SWB- 
10 Op2 operations are executed, and the MEM-Error 1 error occurred, as may be seen from FIG. 8. 

STEP 2: Extract Temporal Event Association Rules 

The events recorded in the logs are divided into 6 groups (or buckets) corresponding to a 

six months total period of time as shown in FIG. 9. Also shown in FIG. 9 is the number of 

occurrences of each listed event. There is an event group for each monthly period. For increased 
15 precision, the period of time corresponding to each event group could be smaller then a month 

(e.g. 1 day, 1 hour, Imin, etc). Alternatively, the period of time corresponding to each event 

group could be larger than a month (e.g. three-months, one year, etc). 

The association rules for each two consecutive intervals are extracted (i.e., deduced), for 

example, by computing the probability that an event from the second monthly interval is 
20 associated with an event from the first monthly interval. For example, this probability P(Ei,Es) 
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for the event Ei (which occurred 200 times in the first month) and the event Es (which occurred 
150 times in the second month) may be computed as follows: 

P(Ei,Es) = no. of occurrences of E5 / no. of occurrence of Ei = 150/200 = 75% 

This probability P(Ei,Es) is considered valid only if no. of occurrences of E5 < no. of 
occurrence of Ei. Otherwise, no association rule is extracted between Ei and E5. Also, if that 
probability P(Ei,Es) is low (e.g., less than 70%), then no association rule is extracted between the 
two events. For example, P(Ei,Ei2) = 10/200 = 0.05% is negligible and lower than the threshold 
(e.g., 70%). Therefore, no association rule between Ei and E12 is extracted. FIG. 10 shows the 
extracted association rules, as derived from FIG. 9. 

STEP 3: Extract Event Sequences 

In this step, the extracted association rules shown in FIG. 10 are utilized to form 
sequences of events that occur with a significant probability. For example, from the two 
association rules shown in FIG. 1 1, the sequence of events in FIG. 12 is generated as the product 
of the corresponding probabilities for Ei E5 and E5 E6, under the assumption that the 
association rules for Ei => E5 and Es => E6 are independent of each other. Similarly, the longest 
sequences of events with the high probabilities (e.g., greater than 55%) may be formed as shown 
in FIG. 13. 
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STEP 4: Clustering Events 

In this step, the events of each month are clustered using the following definition. A 
"cluster" is the maximum set of events where the intra-distance d(Ek, Ep) between any two events 
Ek and E P in the cluster is less than a certain fixed distance do. By applying this definition to the 
present example with d as a simple Euclidian distance and do= 2, the clusters are formed for each 
interval (i.e., each month) are shown in FIG. 14. 

STEP 5: Create Sequences of Clusters . 

FIG. 15 depicts the clusters of FIG. 14 as sequenced, wherein the probability of a cluster 
sequence is the product of the corresponding event sequence probabilities, under the assumption 
that the event sequences are independent of each other. 

STEP 6: Apply the Cluster Sequences to Identify a Probable Cause of the System Failures 

Recall that the original problem which was to find the probable cause of the recent class 
of errors affecting the memory and the software B, which correspond to the events Ei9 and E20. 
These two events (E19 and E20 in month 6) form the cluster C11 which is caused by Ci (in month 
1). Therefore, a probable cause of the errors affecting the memory and the software B is the set of 
events in the cluster Ci; i.e., the execution of the operation SWA-Op2 by the software A. 



STEP 7: Apply the Cluster Sequences To Predict System Failures 

By looking at the same cluster sequences from another perspective and noting that event 
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Ei4 belongs to the cluster Cio whereas events E15 and Ei6 each belong to the cluster C12, it may be 
concluded that if during the present month the event E14 occurs, then it can be predicted that the 
events E15 and Ei6 will occur some time in the next month. Therefore, the same extracted cluster 
sequence can be applied to predict other system failures in the future. Note that future can not be 
5 predicted with certainty. Thus, an event predicted in accordance with the present invention has 
an associated probability of occurrence. 

In consideration of the preceding example and discussion supra thereof in conjunction 
with FIGS. 6-15, as well as the discussion supra of FIGS. 4 and 5, the following discussion 
presents a flow chart in FIG. 16 which describes the method of associating events of the present 

10 invention and a block diagram of computer system for associating events, in accordance with 
embodiments of the present invention. 

FIG. 16 is a flow chart depicting steps 71-77 of a method for associating events, in 
accordance with embodiments of the present invention. 

Step 71 provides an event dataset, that includes a plurality of events occurring in each of 

15 N successive time intervals At,, At 2 , At N _,, At N , subject to N>3. For n=2, 3, and N-l, At n is 
disposed between At n _, and At n+1 such that At n ., occurs before At n+I . A "dataset" is defined herein 
as a collection of data in any known organizational structure and format (e.g., flat file(s), table(s), 
a relational database, etc. The N time intervals may be contiguously sequenced. An earlier time 
interval and a later time interval are said to be contiguously sequenced if the earliest time of the 

20 later time period coincides with the latest time of the earlier time period. Alternatively, the N 
time intervals may not be contiguously sequenced. An earlier time interval and a later time 
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interval are said to not be contiguously sequenced if the earliest time of the later time period 
occurs after the latest time (i.e., with a time gap) of the earlier time period. 

Step 72 deduces, from the event dataset provided in step 71, a plurality of association 
rules. Each association rule of the form E K ^E L expresses an association between events E K and 
E L respectively occurring in successive time intervals of the N time intervals. The events E K and 
E L are any two sequentially-ordered events in the event dataset provided in step 71. 

If E, E 2 and E 2 => E 3 , then the association rules Ej => E 2 and E 2 ==» E 3 are said to be 
sequentially ordered. For each pair of sequentially ordered association rules represented by E } =► 
E 2 and E 2 => E 3 , the association rules Ej => E 2 and E 2 E 3 may be independent of each other. 
Alternatively, for at least one such pair of sequentially ordered association rules, E, => E 2 and E 2 
=> E 3 may not be independent of each other. 

Each association rule may satisfy a condition of a < P KL < 1, wherein P KL is the 
probability that E K and E L respectively occur in the two successive time intervals, and wherein a 
is a predetermined positive real number satisfying a < 1 . The number a may be predetermined to 
have a value reflecting the minimum probability P KL that the user of the method is willing to 
accept. For example, the following ranges for a may be considered as acceptable depending on 
the application: 0.50 < a < 1, 0.60 < a < 1, 0.70 < a < 1, 0.80 < a < 1, 0.90 <; a < 1, and 0.95 <; a 
<1. 

Step 73 generates a plurality of sequences of events. Each sequence of the plurality of 
sequences is generated from at least two sequentially ordered association rules of the plurality of 
association rules deduced in step 72. Each sequence may be of the form E! E 2 -> E 3 E M 
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E, in relation to sequentially ordered association rules E 1 => E 2 , E 2 => E 3 ,..., E M Ej of the 
plurality of association rules deduced in step 72 (I ^ 3). 

Each sequence of events may have a probability of occurrence no less than p, wherein P is 
a predetermined positive real number satisfying p < a. See discussion supra of step 72 for a 
discussion of a. The number p may be predetermined to have a value reflecting the minimum 
probability of occurrence of the sequence of events that the user of the method is willing to 
accept. For example, the following ranges for p may be considered as acceptable depending on 
the application: 0.20 < P < a, 0.30 < P < a, 0.40 < P < a, 0.50 < p < a, 0.60 P < a, and 0.70 < p < 
a. 

Step 74 forms a plurality of clusters from the plurality of sequences generated in step 73 , 
in accordance with a clustering algorithm. Each cluster of the plurality of clusters includes at 
least two sequences of the plurality of sequences generated in step 73. 

Step 75 creates S c sequences of clusters from the plurality of clusters formed in step 74, 
wherein S c £ 1 . Each sequence of the S c sequences includes at least two clusters of the plurality 
of clusters formed in step 74. The S c sequences of clusters so created may be beneficially 
utilized in either or both of steps 76 and 77, described infra. 

Step 76 uses at least one sequence of the S c sequences created in step 75 to identify at 
least one event occurring in a first time interval of the N time intervals as being a probable cause 
of at least one event occurring in a later-occurring time interval of the N time intervals. To 
illustrate, in STEP 6 of the example of the preceding section, the execution of the operation 
SWA-Op2 by the software A in month 1 was identified as a probable cause of the errors affecting 
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the memory and the software B in later-occurring month 6. 

Step 77 uses a first sequence of the S c sequences created in step 75 to predict an 
occurrence of at least one event in a time interval occurring after the N time intervals, wherein 
the at least one event had occurred within the N time intervals. To illustrate, in STEP 7 of the 
example of the preceding section, from the fact that event Eu belongs to the cluster Cio whereas 
events Eis and Ei6 each belong to the cluster C12, it was concluded that if during the present 
month the event Ei4 occurs, then it can be predicted that the events E15 and Ei6 will occur some 
time in the next month. 

FIG. 17 is a block diagram of computer system 90 used for associating events, in 
accordance with embodiments of the present invention. The computer system 90 comprises a 
processor 91, an input device 92 coupled to the processor 91, an output device 93 coupled to the 
processor 91, and memory devices 94 and 95 each coupled to the processor 91. The input device 
92 may be, inter alia, a keyboard, a mouse, etc. The output device 93 may be, inter alia, a 
printer, a plotter, a computer screen, a magnetic tape, a removable hard disk, a floppy disk, etc. 
The memory devices 94 and 95 may be, inter alia, a hard disk, a floppy disk, a magnetic tape, an 
optical storage such as a compact disc (CD) or a digital video disc (DVD), a dynamic random 
access memory (DRAM), a read-only memory (ROM), etc. The memory device 95 includes a 
computer code 97. The computer code 97 includes an algorithm for associating events. The 
processor 91 executes the computer code 97. The memory device 94 includes input data 96. The 
input data 96 includes input required by the computer code 97. The output device 93 displays 
output from the computer code 97. Either or both memory devices 94 and 95 (or one or more 
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additional memory devices not shown in FIG. 17) may be used as a computer usable medium (or 
a computer readable medium or a program storage device) having a computer readable program 
code embodied therein and/or having other data stored therein, wherein the computer readable 
program code comprises the computer code 97. Generally, a computer program product (or, 
5 alternatively, an article of manufacture) of the computer system 90 may comprise said computer 
usable medium (or said program storage device). 

While FIG. 17 shows the computer system 90 as a particular configuration of hardware 
and software, any configuration of hardware and software, as would be known to a person of 
ordinary skill in the art, may be utilized for the purposes stated supra in conjunction with the 
10 particular computer system 90 of FIG. 17. For example, the memory devices 94 and 95 may be 
portions of a single memory device rather than separate memory devices. 

General Formulation of the Invention 

For the purpose of illustration, consider the e-Commerce environment where event 
notifications are generated by many event sources from various electronic marketplaces at 

15 different points of time. Assume an e-Commerce company, called xyz, decides to take advantage 
of the multitude of events that have occurred in the Internet in order to predict the market 
behavior. The company xyz may be interested in predicting the market trends and future needs 
of its customers. For an e-Commerce entity, customers are a crucial category of event sources. 
One of the important events generated by this specific source (i.e., customers) is the creation of a 

20 purchase order. As a consequence of said creation of said purchase order, a notification is sent to 
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the e-Commerce system that is operated by the company xyz. This notification containing the 
detailed information about the purchase order will be stored in the database of the company xyz 
for further processing. If the company xyz succeeds in predicting the categories, qualities and 
quantities of products that will be in demand with enough confidence, then the company xyz can 
request the suppliers of these good to organize the required logistics ahead of time. Possessing 
such predictive capability will be a great asset for the company xyz over its competitors, because 
the company xyz will always have enough goods and better services to satisfy its customers and 
in general will have more useful information about the behaviors of its customers. 

The basic approaches of the present invention to predicting events include: projecting 
occurred events of a specific event source into the future by an event projection algorithm; 
grouping the projected events into several event clusters by an event clustering algorithm; and 
predicting future events of a particular event source and future event classes based on the 
structure of the event clusters, which are next described. 

Events projection includes the projection of the event notifications data generated during 
past time intervals [t 0 ,t 0 +At],...,[t 0 +(n-l).At, t 0 +n.A<t] to the future time interval [to+n.At, 
to+(n+l).At]. In the e-Commerce example, an events dataset has "events" which are records of 
the purchase orders generated by the customers. As stated supra, "datasef ' is defined herein as a 
collection of data in any known organizational structure and format (e.g., flat file(s), table(s), a 
relational database, etc. It is desired to project the events in the events dataset to the future in 
order to predict the purchase orders that will occur with a certain confidence c>c 0 , wherein c 0 is a 
predetermined confidence threshold. Although there is more than one solution for this problem, 
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the focus on a solution herein is inspired by the fact that, in general, an event is not merely an 
isolated incidence, but rather a part of a set of relevant events associated with each other. The 
present invention takes the approach of an association rules domain to discover patterns allowing 
a projection of events with a certain minimum confidence. Association rules are statements of 
the form such as "99% of the French people who like wine also like cheese". 

A formal description of association rule is as follows. Let F={e 1 ,e 2 ,...,e m } be a set of 
literals, called items. A set XcF is called an itemset. A k-itemset is an itemset containing k 
items. A set of itemsets is called a database, and each element of this database is called a 
transaction. A transaction T contains an itemset X if XcT. An association rule is an implication 
of the form X=> Y, where XcF, YcF, and XnY=0. The association rule X=>Y holds in the 
transaction set D with confidence c if c% of transactions in D that contains X also contains Y. 
The association rule X=> Y has support s in the transaction set D if s% of transactions in D 
contains Xu Y. Given a database Z), the problem of mining association rules is to generate all 
association rules that have certain user-specified support (called minsup) and confidence (called 
minconf). This problem can be divided in two subproblems: 1) given a specified minimum 
support minsup, find all combinations of items that have transaction support greater than minsup 
(called large itemsets - all other combinations are called small itemsets); and 2) after finding the 
large itemsets, generating the desired association rules. The a priori algorithm shown in Table 1 
is used to find such large itemsets. 
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Table 1 . A Priori algorithm 

L, = {large 1-itemsets}; 

for ( k = 2; L k _, * 0; k++) do begin 

insert into C k // New candidates 

select p[l],p[2] v ..,p[k-l] 5 q[k-l] 

from L k _, p, L k _, q 

where p[l]=q[l],.. ? p[k-2]=q[k-2] ? p[k-l] < q[k-l]; 
forall transactions tcfldo begin 

C t = subset(C k , t); // Candidates contained in t; 

forall candidates c e C t do 
c.count++; 

end. 

L k = { c e C k I c.count m minsup } 

end 



The preceding algorithm in Table 1 generates the result of U k L k , wherein both L k and C k 
are itemsets containing k items. L k is a set of large ^-itemsets. Each member of this set has two 
fields: i) the itemset and ii) the support count. C k is a set of candidate &-itemsets (potentially 
large itemsets). Candidate itemsets C k are stored in a hash-tree. A node of the hash-tree either 
5 contains a list of itemsets (a leaf node) or a hash table (an interior node). In an interior node, 

each bucket of the hash table points to another node. The root of the hash-tree is defined to be at 
depth 1 . An interior node at depth d points to nodes at depth d+1 . Itemsets are stored in the 
leaves. When an itemset c is added, the procedure starts from the root and go down the tree until 
a leaf is reached. At an interior node at depth d, it is decided which branch to follow by applying 
10 a hash function to the d th item of the itemset, and following the pointer in the corresponding 

bucket. All nodes are initially created as leaf nodes. When the number of itemsets in a leaf node 
exceeds a specified threshold, the leaf node is converted to an interior node. Starting from the 
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root node, the subset function finds all the candidates contained in a transaction t as follows. If at 
a leaf, the procedure finds which of the itemsets in the leaf are contained in t and adds references 
to them to the answer set. If at an interior node which has been reached it by hashing the item I, 
then the procedure hashes on each item that comes after I in t and recursively apply this 
5 procedure to the node in the corresponding bucket. For the root node, the procedure hashes on 
every item in t. 

After finding the large itemsets, the procedure uses the large itemsets to generate the 
desired association rules, using the association rule generation algorithm listed in Table 2. The 
general idea is that if ABCD and AB are large itemsets, then it can be determined if the rule 
10 AB=>CD holds by computing the ratio r = support(ABCD)/support(AB). The association rule 
holds only if r ^ minconf. Note that the association rule will have minimum support, because 
ABCD is large. 
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Table 2. Rule Generation Algorithm 



forall large k-itemsets / k , k m 2 do begin 

H, = { consequents of rules from / k with one item in the consequent }; 
call ap-genrules(/ k , H,); 

end 

procedure ap-genrules(/ k : large A>itemset, 

H m : set of m-item consequent) 

if (k > m + 1) then begin 

insert into H m+1 

select p[l],p[2],...,p[r-l],q[M] 

from H m p, H m q // H m having r members 
where p[l]=q[l],...,p[r-2]=q[r-2],p[r.l] < q[r-l]; 
forall h m+1 6 H m+1 do begin 

conf = support(l k )/support(l k - h^,); 
if (conf £ minconf) then 

output the rule (l k - h m+1 ) => h m+1 with 
confidence=conf and support=support(l k ); 
else delete h m+1 from H m+1 ; 

end 

callap-genrules(l k ,H m+1 ); 



A t-bucket is defined as a set B of event notifications fired during a specific time interval 
[t B ,t B +At] which is called a ('interval Two t-buckets are t-consecutives if their t-intervals have 
the forms [t B ,t B +At] and [t B +At,t B +2At] respectively. Let (B„B 2 ) be a couple of t-consecutive t- 
5 buckets. Event notification e,eB, and e 2 eB 2 are respectively transformed to a couple (e 1? 0) and 
(e 25 l) 5 wherein the second element of the couple is 0 and 1 5 respectively, if the first element is an 
event notification from the older t-bucket B, and the second element is an event notification from 
the newer t-bucket B 2 . This transformation is called an El-transformation relative to (B,,B 2 ) and 
the results EIT B1 B2 (e 1 )=(e,,0) and EIT B1 B2 (e 2 )=(e 2 ,0) are called e-items. The reciprocal 
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transformation is IET()=EIT I B1 B2 . Let l { and I 2 be the results of an El-transformation of two t- 
consecutives t-buckets B, and B 2 . I, u I 2 is called a t-transaction. Note that 1,012=0. A t- 
database D is defined as a set of t-transactions. The JY algorithm in Table 3, which takes as 
input a set of past event notifications fired over a relatively long period of time, generates rules 
5 which can be used to predict event notifications. The following notation is used: (1) E is a set of 
event notifications fired over a period of time [t^to+n.At]; (2) B k isthe t-bucket associated with 
the t-interval [to+(k-l).At, VHc At], k e { 1 , .., n} ; (3) T k is the t-transaction I k ul k ' , where I k and 
I k ' are the result of the El-transformation of B k and B k+1 . Note that I k nl k '=£>, ke{l, .., n-1}; (4) 
D is the set of the t-transactions T k (a t-database), k e { 1, .., n-1 }; (5) L is the union of all large 
10 itemsets generated by applying the a priori algorithm of Table 1 to the database D; and (6) 7? is 
the set of rules generated by applying the rules generation algorithm of Table 2 to L. 

0 



END920030105US1 



Table 3. JY Algorithm 



D = new DatabaseQ; //create an empty database 

for (k = 1 ; k < n; k++) do begin 

T k = new Transaction(); //create an empty transaction 

forall event notifications e e B k U B k+1 
T k = T k U{EIT Bk3k+1 (e)}; 

end 
end 

D= U k T k ; 

L = Apriori(X>); // apply the Apriori algorithm to D. 

R = genrules(Z); // apply the rule generation algorithm to L. 

B n+1 = new t-bucket(); //create an empty t-bucket for containing the predicted events, 
forall association rule X=>Y in R do begin 
if XcB n then 

B n+1 =B n+1 UlET(Y); 
end 

end 



B n+1 is generated by the JY Algorithm, wherein B n+1 is interpreted as the set of the 
predicted event notifications which will occur between to+nAt and to+(n+l)At. As shown in the 
JY-algorithm of Table 3, very efficient rule generation algorithms are used to generate predictive 
rules of the form "if the event e occur f times between t and t+At, then there is a probability 98% 
that the events e„ e 2 , e 3 will occur f l5 f 2 , f 3 times, respectively, between t+At and t+2 At". If the 
first part of this rule is verified in the current t-bucket then, it can be "predicted" that the second 
part will be verified in the next t-bucket with 98% confidence. FIG. 1 depicts buckets B k and 
B k+1 which respectively include events e and e„ e 2 , e 3 (with real event e in bucket B k and 
predicted events e,, e 2 , e 3 in bucket B k+1 as shown). FIG. 1 illustrates the association rule e e B k 
=* e l5 e 2 , e 3 e B k+1 , in accordance with embodiments of the present invention. Note that a large 
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amount of information can be obtained from the different types of association rules. For 
instance, the association rule e e B k => e„ e 2 „ e 22 , e 3 e B k+1 , where e 2 „ e 2 2 are two instances of 
the same event e 2 , means that if the event e occur during the B k 's t-interval, then it can be 
"predicted" that the event e 2 will occur exactly two times during the B k+1 's t-interval (the next t- 
5 bucket) with a certain confidence. This type of association rule provides information about the 
frequency of the event. Other rules can provide information about the relationships between 
events that are part of the same transaction. In the e-Commerce example, the generated rules can 
be applied to predict, with a certain confidence, the number and types of the purchase orders 
expected in the near future. At a higher level, the generated rules can be utilized to "predict" the 

10 product categories in high demand and the new product types which the customers will be 

interested to buy, which will be discussed infra. Additionally, the JY algorithm can be applied 
recurrently to predict not only the next t-bucket, but also a certain number of future t-buckets as 
long as the confidence of the prediction is above a threshold value. Then, these predictions can 
be used to define a path of a source of events which is the set of all the events generated by that 

15 source of events. A part of this path is real and the other part is predicted as seen in FIG. 2 which 
depicts sequentially-ordered buckets B n , B n+1 , B n+p .,, B n+p (with real event e in bucket B n and 
predicted events in buckets B n+1 , B n+p _„ B n+p as shown), in accordance with embodiments of 
the present invention. In the e-commerce example, the customer is one instance of events source. 
The path of the customer would be the set of all the past and predicted purchase orders. This 

20 path of the customer provides valuable information about the behavior of each individual 

customer. The combination of several customers' paths provides information about the whole 
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direction taken by the market (see infra the next section about clustering and predicting event 
classes). After generating the predicted event notifications, the next step is to discover the 
structure of these objects in order to obtain a higher level of projection. 

Next the results of the event projection are used to identify the structure of the predicted 
events and event notifications in order to predict event classes. For the e-Commerce example it 
may be desired to build a higher level of projection, by predicting the categories of goods that 
will be in high demand by the customers. Additionally as a "side effect", the new types of 
products that the customers would like to see in the marketplace may be discovered. To achieve 
this goal, the predicted event notifications are clustered. The result of this classification is a 
number of clusters that will be interpreted as the predicted event classes. 

Clustering may be used for identification of homogeneous groups of objects, based on 
whatever data available. Consider a set N= {e„ e 2 ,..., e n } of n event notifications. The cluster 
analysis is the set of operations that provides techniques for subdividing the set N into a certain 
number of classes. A classification may be either: (a) a partition C = {C,,..., C m } of N with a 
suitable number m of nonoverlapping clusters C lv ..,C m c N, or (b) a system of hierarchically 
nested clusters, subclusters,...etc, which generates a tree or a dendogram. The predicted event 
notifications may be classified by going through the following steps: (a) selection of entities to 
cluster; (b) selection of dissimilarity measures; (c) selection of a clustering method; (d) 
determining the number of clusters; and (e) interpretation, testing, and replication. 

(a) For the selection of entities to cluster, the predicted event notifications to be classified 
are selected. At this stage, "ideal types" are placed in the data sets. The data values of an ideal 
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type are specified to represent an event notification that would typify the characteristics of a 
cluster suspected to be present in the data. One ideal type would be specified for each 
hypothesized cluster. The presence of a single ideal type in a cluster would suggest an 
interpretation for that cluster. 

(b) The selection of a dissimilarity measure between the event notifications (or event 
instances) is the next step in the clustering process. The dissimilarity measure corresponds to 
the metric within which the clusters are believed to be embedded. That is, the measure should 
reflect those characteristics that are suspected to distinguish the clusters present in the data. In 
most cases, one can compare recovery performance between the Pearson correlation coefficient 
and Euclidean distance or other members of the Minkowski metric family. If a simulation study 
is conducted, the optimal dissimilarity measure will be highly dependent on the nature of the 
generated data. Rather, a simulation study can focus on a particular structure of interest. 

(c) The selection of a clustering method may take into account a number of aspects, such 
as the following four aspects. A first aspect is that the method may be designed to recover the 
cluster types suspected to be present in the data. A second aspect is that the clustering method 
may be effective at recovering the structures for which it was designed. A third aspect is that the 
method may be insensitive to the presence of error in the data. A fourth aspect is that access to 
computer software to carry out the method may be utilized. 

(d) Determining the number of clusters in the final solution is a significant problem in an 
applied cluster analysis. Most clustering methods are not designed to determine the number of 
clusters in the data. Rather in most clustering methods, the user must pre-specify the number of 
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clusters as is the case for hierarchical algorithms. The most active research about selecting the 
number of clusters has involved procedures for hierarchical methods. Frequently, the methods 
are called stopping rules because the procedures indicate where one is to stop in the hierarchical 
solution. If partial agreement is found, then the procedure can opt for the larger number of 
5 clusters. 

(e) Interpretation, testing, and replication completes the procedure. The interpretation is 

based within the applied discipline area of the researcher. Hypothesis testing can be conducted to 

determine whether significant cluster structure exists in the partitions found by an algorithm. 

Valid testing procedures generally can be divided into two major categories. The first is external 
10 criterion analysis and is based on (exogenous) variables not used in the cluster analysis. The 

second approach is called internal criterion analysis based on information and variables obtained 

from or used in the clustering. 

After clustering the predicted events and interpreting these clusters as the predicted 

classes of events, it is desirable to attach to each predicted class a measure of confidence in this 
15 prediction. This confidence may be utilized to compare the results of several clustering methods 

in order to select the best one. 

As the event notification, consider a set of (key, value) pairs. Each event notification can 

be modeled as a vector having a dimension n. In the space of these vectors, each cluster is 

equivalent to a volume. For instance, this volume could be a sphere in an n-dimensional space. 
20 Let C n be a cluster of event instances included in a t-bucket B n and let V(C n ) be the corresponding 

volume. Note that V(C n ) is independent of B n and the specific event notifications contained in 
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C n . Let B n .! be the t-antecedent of B n . The set A(C n )=V(C n )nB n _, of event notifications is called 
the antecedent of the cluster C n as illustrated in FIG. 3 which depicts instances of cluster volumes 
at successive event notifications in consecutive buckets B p , B p+1 , B n . l9 B n , in accordance with 
embodiments of the present invention. Consider the counts c n , c n _ 1? . . c^,, c p of the real event 
5 notification included in A ! (C n ), . . A^CJ, A p (C n ) respectively. Then time series techniques 

can be applied to (c k ) ke{ p >p+1> >n . 1} in order to predict c n , the count of the event notifications in the 
cluster C n . Let c n ' be this predicted number. Then | c n -c n '|/c n 5 is a measure of the confidence in 
our event class prediction. 

While embodiments of the present invention have been described herein for purposes of 
10 illustration, many modifications and changes will become apparent to those skilled in the art. 

Accordingly, the appended claims are intended to encompass all such modifications and changes 
as fall within the true spirit and scope of this invention. 
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